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Measuring similarities/dissimilarities between atomic structures is important for the exploration of potential 
energy landscapes. However, the cell vectors together with the coordinates of the atoms, which are generally 
used to describe periodic systems, are quantities not suitable as fingerprints to distinguish structures. Based 
on a characterization of the local environment of all atoms in a cell we introduce crystal fingerprints that can 
be calculated easily and allow to define configurational distances between crystalline structures that satisfy 
the mathematical properties of a metric. This distance between two configurations is a measure of their 
similarity/dissimilarity and it allows in particular to distinguish structures. The new method is an useful 
tool within various energy landscape exploration schemes, such as minima hopping, random search, swarm 
intelligence algorithms and high-throughput screenings. 
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I. INTRODUCTION 

Large data sets of crystalline structures are nowadays 
available in two major contexts. On one hand, databases 
of materials have been created containing structural in¬ 
formation of both experimental and theoretical com¬ 
pounds from high-throughput calculations, which are the 
basis for data-mining techniques in materials discovery 
project^^Jl^. On the other hand, ab initio structure pre- 
diction^^n^can produce a huge number of new structures 
that have either not yet been found experimentally or are 
metastabld^^^^. In both cases it is essential to quantify 
similarities and dissimilarities between structures in the 
data sets, requiring a configurational distance that sat¬ 
isfies the properties of a metric. Databases frequently 
contain duplicates and insufficiently characterized struc¬ 
tures which need to be identified and filtered. In experi¬ 
mental data, the representation of identical structures as 
obtained from different experiments will always slightly 
differ due to noise in the measurements, such that the 
configurational distance is never exactly zero. Noise is 
also present in theoretical calculations where a geometry 
relaxation is for instance stopped once a certain, possi¬ 
bly insufficient convergence threshold is reached. In ab 
initio structure prediction schemes it is typically neces¬ 
sary to maintain some structural diversity which can be 
quantified as a certain minimal configurational distance. 
All these examples clearly show the need for a metric 
that allows to measure configurational distances and lo¬ 
cal structures in a reliable and efficient way. 

Crystalline structures are typically given in a dual rep¬ 
resentation. The first part specifies the cell and the sec- 
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ond part the atomic positions within the cell. The former 
can for instance be given by the three lattice vectors a, b 
and c, or by their lengths a, b and c, and the intermediate 
angles a, p and 7 . The atomic positions can either be 
specified by cartesian coordinates or the reduced coordi¬ 
nates with respect to the lattice vectors. However, such 
representations are not unique, since any choice of lattice 
points can serve as cell vectors of the same crystalline 
structure. Unique and preferably standardized cell pa¬ 
rameters are required for comparison and analysis of dif¬ 
ferent crystals^. Algorithms to transform unit cells to a 
reduced form are freque ntly u sed in crystallography, such 
as the Niggli-reduct iorP^^ which produces cells with 
shortest possible vectors (|a + b + c| = minimal). Unfor¬ 
tunately, in the presence of noisy lattice vectors, cells can 
change discontinuously within the Niggli-reduction algo¬ 
rithm. Symmetry analysis and the corresponding classi¬ 
fication in the 230 crystallographic space groups are an¬ 
other tool to compare crystal structures. However, the 
outcome of a symmetry analysis algorithm strongly de¬ 
pends on a tolerance parameter such that the introduc¬ 
tion of some noise can change the resulting space group in 
a discontinuous manner. Because of the above described 
problems it is difficult to quantify similarities based on 
dual representations. 

Within the structure prediction community finger¬ 
prints that are not based on such a dual representation 
have been proposed. Oganov et al.^^ introduced element 
resolved radial distribution functions as a crystal finger¬ 
print. For a crystal containing one element only a single 
function is obtained for the entire system. The difference 
between the radial distribution functions of two crystals 
is then taken as the configurational distance. By defini¬ 
tion the radial distribution function contains only radial 
information, but no information about the angular distri- 
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bution of the atoms. Such angular information has been 
added i n the bond characterization matrix (BCM) fin- 
gerprintP^^. In this fingerprint spherical harmonic and 
exponential functions are used to set up modified bond- 
orientational order metric^^ of the entire configuration. 
The distance between two configurations can be mea¬ 
sured by the Euclidean distance between their BCMs. 
Atomic environment descriptors are also needed in the 
context of machine learning schemes for force fieldJ^^M^, 
bonding pattern recognitioiP^, or to compare vacancy, 
interstitial and intercalation siteJ^. These descriptors 
could also be used to measure similarities between struc¬ 
tures. Even though they have never been used in this 
context we will present a comparison with such a de¬ 
scriptor. 

When humans decide by visual inspection whether two 
structures are similar they proceed typically in a differ¬ 
ent way. They try to find matching atoms which have 
the same structural environment. If all the atoms in one 
structure can be matched with the atoms of the other 
structure, the two structures are considered to be identi¬ 
cal. Such a matching approach based on the Hungarian 
algorithnP^ has alre ady tu rned out to be useful for the 
distinction of clusterP^^. 

In this paper we will present a fingerprint for crys¬ 
talline structures which is based on such a matching ap¬ 
proach. The environment of each atom is described by an 
atomic fingerprint which is calculated in real space for an 
infinite crystal and represents some kind of environmen¬ 
tal scattering properties observed from the central atom. 
Therefore, all the ambiguities of a dual representation do 
not enter into the fingerprint, allowing an efficient and 
precise comparison of structures. 


II. FINGERPRINT DEFINITION 

Recently we have proposed an configurational finger¬ 
print for cluster^. In this approach an overlap matrix is 
calculated for an atom centered Gaussian basis set. The 
vector formed by the eigenvalues of this matrix forms a 
global fingerprint that characterizes the entire structure. 
The Euclidian norm of the difference vector between two 
structures is the configurational distance between them 
and satisfies the properties of a metric. 

Since there is no unique representation of a crystal by a 
group of atoms (e.g. the atoms in some unit cell) we will 
use atomic fingerprints instead of global fingerprints in 
the crystalline case. However, this atomic fingerprint is 
closely related to our global fingerprint for non-periodic 
systems. Eor each atom /c in a crystal located at R/. 
we obtain a cluster of atoms by considering only those 
contained in a sphere centered at • For this cluster we 
calculate the overlap matrix elements as described in 
reference [Ml foi* a non-periodic system, i.e we put on each 
atom one or several Gaussian type orbitals and calculate 
the resulting overlap integral. The orbitals are indexed 
by the letters i and j and the index w{i) gives the index 


of the atom on which the Gaussian Gi(r) is centered, i.e. 


SIj = J dr Gi{r- R„(i) ) Gj (r - R„(,)) (1) 

In this first step, the amplitudes of the Gaussians Cnorm 
are chosen such that the Gaussians are normalized to one. 
To avoid that the eigenvalues have discontinuities when 
an atom enters into or leaves the sphere we construct in 
a second step another matrix such that 

Ttj = /c(|R»(i) - - R,|) (2) 

The cutoff function fc smoothly goes to zero on the sur¬ 
face of the sphere with radius 


fc{r) = 



( 3 ) 


In the limit where n tends to infinity the cutoff function 
converges to a Gaussian of width ctc- The characteristic 
length scale dc is typically chosen to be the sum of the 
two largest covalent radii in the system. 

The value n determines how many derivatives of the 
cutoff function are continuous on the surface of the 
sphere, and n = 3 was used in the following. One can 
consider the modified matrix to be the overlap ma¬ 
trix of the cluster where the amplitude of the Gaussian at 
atom i is determined by Cnorm /c(|Ri — |I^/c|)- In this way 
atoms close to the surface of the sphere give rise to very 
small eigenvalues of and are thus weighted less than 
the atoms closer to the center. The eigenvalues of this 
matrix are sorted in descending order and form the 
atomic fingerprint vector V/^. Since we can not predict 
exactly how many atoms will be in the sphere we esti¬ 
mate a maximum length for the atomic fingerprint vector. 
If the number of atoms is too small to generate enough 
eigenvalues to fill up the entire vector, the entries at the 
end of the fingerprint vector are filled up with zeros. This 
also guarantees that the fingerprint is a continuous func¬ 
tion with respect to the motion of the atoms when atoms 
might enter or leave the sphere. If an atom enters into the 
sphere some zeros towards the end of the fingerprint vec¬ 
tor are transformed in a continuous way into some very 
small entries which only contribute little to the overall 
fingerprint. The Euclidean norm |V/c — V/| measures the 
dissimilarity between the atomic environments of atoms 
k and 1. 

The atomic fingerprints and of all the Nat 
atoms in two crystalline configurations p and q can now 
be used to define a configurational distance d{p,q) be¬ 
tween the two crystals: 


d{p^ q) = nhn 



I 

k ^ P{k) 


1/2 


( 4 ) 


where P is a permutation function which matches a cer¬ 
tain atom k in crystal p with atom P{k) in crystal q. The 
optimal permutation function which minimizes d{p, q) 
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can be found with the Hungarian algorithnP^ in poly¬ 
nomial time. If the two crystals p and q are identical the 
Hungarian algorithm will in this way assign correspond¬ 
ing atoms to each other. The Hungarian algorithm needs 
as its input only the cost matrix C given by 

Ck,i = \n-v? 

In the following it will be shown that d{p^ q) satisfies 
the properties of a metric, namely 

• positiveness: d{p, > 0 

• symmetry: d{p,q) = d{q,p) 

• coincidence axiom: d{p, = 0 if and only if p = q 

• triangle inequality: d{p, r) + d{r, q) > d{p^ q). 

From the definition (Eq. it is obvious that the posi¬ 
tiveness and symmetry conditions are fulfilled. The coin¬ 
cidence theorem is satisfied if the individual atomic fin¬ 
gerprints are unique, i.e if there are not two different 
atomic environments that give rise to identical atomic 
fingerprints. In our work on fingerprints for clusters we 
have shown that the fingerprints can be considered to be 
unique if they have a length larger or equal to 3 per atom. 
The triangle inequality can be established in this way: 

/Afat \ 

d(p,r) + d(r, 5 )= 

/Nat \ 

/Nat \ 

/Nat \ 

= d{p, q) 

where P, P' and Q are assumed to be the permutations 
that minimize respectively the Euclidean vector norms 
associated to d{p,r), and d{p,q). 

III. CONTRACTED FINGERPRINTS 

Since the R/^-centered spheres contain typically about 
50 atoms, an atomic fingerprint has at least length 50 
if only s-type Gaussian orbitals or length 200 if both s 
and p orbitals are used. Since a configuration is charac¬ 
terized by the ensemble of all the atomic fingerprints of 
all the atoms in the cell, the amount of data needed to 
characterize a structure is quite large even though it is 
certainly manageable for crystals with a small number of 
atoms per unit cell. Storage requirements might however 


become too high in certain cases such as large molec¬ 
ular crystals. We will, therefore, introduce contraction 
schemes that allow to considerably reduce the amount 
of data necessary to characterize a crystalline structure. 
Two such schemes will briefly be discussed below. 

A. Contractions by properties 

Let us introduce a function r(i) that designates a cer¬ 
tain property of the Gaussian orbital i and encodes it in 
form of a contiguous integer index. In case of a multi- 
component crystal it can indicate on which kind of chem¬ 
ical element the Gaussians are centered and whether the 
orbital is of s or p type. The principal vector is thus 
chopped into pieces whose elements all carry the same 
value r(i). In the following presentation of numerical re¬ 
sults we have always considered the central atom to be 
special, independent of its true chemical type. Having m 
atomic species in the unit cell and using atomic Gaussian 
orbitals with a maximum angular momentum /max, 
runs from 1 to (m + l)(/max + !)• Now we can construct 
a contracted matrix 

together with its metric tensor 

i 

where is the principal vector of the matrix of Eq.[^ 
The eigenvalues A of the generalized eigenvalue problem 

= Xs^v 

form again an atomic fingerprint of length (m + l)(/max + 
1) which is much shorter than the non-contracted finger¬ 
print V/e. 

B. Contractions to form molecular orbitals for molecular 
crystals 

The fingerprints described so far can in principle also 
be used for molecular crystals. However, the amount of 
data needed to characterize such crystals can be quite 
large if the molecules forming the crystal contain many 
atoms. By creating molecular orbitals in analogy with 
standard methods in electronic structure calculations the 
required amount of data can be considerably reduced. 
The eigenvalues arising from the overlap matrix in this 
molecular basis set will then form a fingerprint for the 
molecular crystal. The molecular orbitals can be ob¬ 
tained in the following way: for each molecule k in our 
unit cell we cut out a cluster of molecules within a sphere 
of a certain radius. Eor each molecule a in this sphere we 
set up the overlap matrix by putting Gaussian type or¬ 
bitals on all its constituent atoms. Then we calculate for 
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this matrix the eigenvalues and eigenvectors. The prin¬ 
cipal vectors belonging to several of the largest 

eigenvalues AJJ are subsequently used for the contraction: 


= ( 5 ) 

hi 

No metric tensor is required since the set of vectors used 
for the contraction is orthogonal. The molecular orbitals 
have characteristic patterns, such that the orbital corre¬ 
sponding to the first principal vector has no nodes, while 
the orbitals of the following principal vectors have in¬ 
creasing number of nodes. They are therefore similar to 
the atomic orbitals of 5, p and higher angular momentum 
character, which were used for the fingerprints in the or¬ 
dinary crystals. In Fig.j^these orbitals are shown for the 
case of the paracetamol molecule. 

By multiplying S with some cutoff function as in Eq. 
we can then obtain molecule centered overlap matrices in 
this molecular basis which is free of discontinuities with 
respect to the motion of the atoms. In the molecular case 
the value of the cutoff function depends on some short 
range pseudo-interaction between the central and the sur¬ 
rounding molecules. This interaction Uk^a between the 
central molecule k and another molecule a is given by 


U k,a 



(6) 


where the sum over i runs over all the atoms in the cen¬ 
tral molecule k and the sum over j over all the atoms in 
the surrounding molecule a. dij is the distance between 
the atoms i and j and is the sum of the van der 

Waals radii of the two atoms. The interaction is taken to 
vanish beyond its first zero. Because of the short range of 
the interaction, molecules sharing a large surface will be 
coupled strongly. The analytical form of the cutoff func¬ 
tion is identical to the one for the atomic case (Eq. [^. 
However, since a cartesian distance between molecules 
is ill defined, the argument in Eq. is modified. The 
scaled distance rac between the atoms is replaced by the 
normalized interaction between the molecules 

Uk^a 

Uk,k 

The eigenvalues of this final overlap matrix form now a 
fingerprint describing the environment of this molecule 
k with respect to the other molecules. To compare two 
structures this procedure is done for all molecules con¬ 
tained in the corresponding unit cell. A configurational 
distance is calculated then as in Eq. [^by using the Hun¬ 
garian algorithnP^. 


IV. APPLICATION OF FINGERPRINT DISTANCES TO 
EXPERIMENTAL STRUCTURES 

Structural data found in various material databases 
is frequently obtained from measurements at different 


temperatures which results in thermal expansion. Simi¬ 
larly, measurements at different pressures or low quality 
x-ray diffraction patterns can lead to slight cell distor¬ 
tions. Obviously our fingerprint distances among such 
expanded or distorted but otherwise identical structures 
are different from zero. Eor these reasons we have intro¬ 
duced a scheme where the six degrees of freedom associ¬ 
ated to the cell are optimized while keeping the reduced 
atomic coordinates fixed such as to obtain the smallest 
possible distance to a reference configuration. The gradi¬ 
ent of our fingerprint distance with respect to the lattice 
vectors can be calculated analytically using the Hellmann 
Eeynman theorem. An application of the lattice opti¬ 
mization scheme was applied to a subset of Zr 02 struc¬ 
tures taken from the Open Quantum Materials Database 
(OQMDj^, as will be discussed in further detail later in 
the following section. 


V. NUMERICAL TESTS 

Fig.[T] shows all possible pairwise configurational dis¬ 
tances obtained with several fingerprints for various data 
sets. Different fingerprints are plotted along the x and y 
axis. LEP stands for the uncontracted long fingerprint 
and in square parenthesis it is indicated whether only s or 
both s and p orbitals were used to set up the overlap ma¬ 
trix, SEP[s] stands for the short contracted fingerprint 
with s orbitals only where the properties used for the 
contraction are central atom and the element type of the 
neighboring atoms in the sphere. Eor materials that have 
only one type of element (Si in our case) the atomic fin¬ 
gerprint has only length two and the coincidence theorem 
is not satisfied. Even though there are hyperplanes in the 
configurational space where different configurations have 
identical fingerprints, it is very unlikely that different lo¬ 
cal minima lie on such hyperplanes and the fingerprint 
can therefore nevertheless well distinguish between iden¬ 
tical and distinct structures. If both s and p orbitals are 
used (SEP[5p]) the atomic fingerprint has at least length 
4 and no problem with the coincidence theorem arise. In 
addition we also show the config urati onal distances aris¬ 
ing from the Ogano\S31and BCIVP^^^ fingerprints as well 
as from a fingerprint based on the amplitudes of sym¬ 
metry function^. All our data sets contain both the 
global minimum (geometric ground state) as well as lo¬ 
cal minima (metastable) structures, obtained from min¬ 
ima hopping run^. Energies and forces were calculated 
with the DETBd^ method for SiC and the molecular 
crystals, and the Lenosky tight-binding scheme was used 
for SP. Eor the CsPbIs perovskite and the transparent 
conductive oxide Zn 2 Sn 04 plane wave density functional 
theory (DET) calculations were used as implemented in 
the quantum Espresso cod^^^l^. 

The first test set consists of clathrate like structures 
of low density silicon allotropeP. Low density silicon 
gives rise to a larger number of low energy crystalline 
structures than silicon at densities of diamond silicon and 
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thus poses an ideal benchmark system. In the first line of 
the figure we show the results of a relatively sloppy local 
geometry optimization, where the relaxation is stopped 
once the forces are smaller than 5.e-2 eV/A. Gaps sep¬ 
arating identical from distinct structures are hardly vis¬ 
ible for all fingerprints. Once a very accurate geometry 
optimization with a force threshold of 5.e-3 eV/A is per¬ 
formed, gaps become visible for all the fingerprints. 

The second data set is silicon carbide, a material well 
known for its large number of polytypes. Our fingerprint 
gives rise to a small gap whereas the configurational dis¬ 
tances based on all other fingerprints do not show any 
gap at all. The opening of a gap can again be observed 
once the geometry optimization is done with high accu¬ 
racy. For this case all fingerprints result in a gap, but like 
for all test sets it is the least pronounced for the BCM 
fingerprint. Both the Oganov and BCM fingerprints are 
global ones such that information is lost in the averaging 
process of these fingerprints as the system gets larger. 
Therefore, it is not surprising that the gap again disap¬ 
pears even for the high quality geometry optimization 
once one goes to large cells. 

The next two test sets consist of an oxide material and 
a perovskite with their characteristic building blocks of 
octahedra and tetrahedra which can be arranged in a 
very large number of different ways. All our fingerprints 
give rise to clear gaps separating identical from distinct 
structures. The Oganov fingerprint also gives rise to clear 
gaps whereas the BCM fingerprint only weakly indicates 
some gap. The Behler fingerprint gives a well pronounced 
gap for Zn2Sn04 but only a blurred gap for CsPbIs. 

The last theoretical test system is a platinum surface. 
In this c ase th e energies were calculated with the Morse 
potentiaPS^^. The geometry optimization were done 
with high accuracy and therefore a big gap is visible in 
all cases. 

Fig. shows the correlation between the energy differ¬ 
ence and the fingerprint distance for all the test cases of 
Fig. El Except for the very large 256 atoms system there 
exists always a clear energy gap if the geometry optimiza¬ 
tion was done with high accuracy. Even though there is 
of course the possibility of nearly degenerate structures, 
this seems to happen rarely in practice and energy is thus 
a rather good and simple descriptor for small unit cells. 

To test our molecular fingerprint, two test systems 
were employed, namely crystalline formaldhyde and 
paracetamol. The formaldehyde system comprised 240 
structures with 8 molecules per cell and the paraceta¬ 
mol system 300 structures with 4 molecules per cell. 
The two top panels of Eig. show the molecular fin¬ 
gerprint distance versus the energy difference of differ¬ 
ent structures of paracetamol and formaldehyde, respec¬ 
tively. The two bottom panels show the correlation of 
the standard fingerprint against the molecular fingerprint 
for both systems. The existence of a gap in the pair¬ 
wise distance distributions clearly indicates that identi¬ 
cal and distinct structures can be identified by both fin¬ 
gerprints. However, the molecular fingerprint vector is 


considerably shorter because only six principal vectors 
were used (shown in Eig. |^. Since six is the number of 
degrees of freedom of a rigid rotator it is expected that 
this fingerprint is long enough to satisfy the coincidence 
theorem. 

Next we applied our fingerprint to Zr02 structures con¬ 
tained in the OQMD. 115 different entries were available 
at this composition. The structures were either based 
on experimental data retrieved from the Inorganic Crys¬ 
tal Structure Database (ICSD) or on binary structural 
prototypes. When the OQMD was initially created, du¬ 
plicate entries were identified with the structure compar¬ 
ison algorithm as implemented in the Materials Inter¬ 
face (MINT) software packag^^ which employs a 6-level 
test that includes cell reduction as well as an analysis 
of the lattice symmetry. Structures classified as identi¬ 
cal to an existing entry in OQMD were mapped to that 
entry without performing a structural relaxation. There¬ 
fore, the structural data set contains both DET opti¬ 
mized and experimental structures, resulting in noise on 
the atomic and cell coordinates arising from the numeri¬ 
cal calculations as well as from the different experiments 
and thermal effects. In Eig. we show the ordinary and 
the lattice vector optimized fingerprint distances for all 
115 structures from the database. We can see that the 
fingerprint distance can be reduced down to about l.e-7 
for many structures. Eor some of them the initial finger¬ 
print distances were as large as 0.1. This allows to detect 
some identical structures whose initial large fingerprint 
distance was only due to thermal expansion. However, 
even with lattice vector optimization it was not possible 
to decide for the whole data set in an inambiguous way 
which structures are identical and which were not. There¬ 
fore, local geometry optimizations were performed at the 
DET level for all structures using the VASP cod^^^®Il. 
A plane wave cutoff energy of 520 eV was used together 
with a dense /c-point mesh. Both the atomic and cell 
variables were relaxed until the maximal force component 
was less than 2.e-3 eV/A and the stress below l.e-2 GPa. 
Panel (b) of Eig. shows the DET energy differences of 
the relaxed structures against the fingerprint distances, 
showing a clear gap that allows to distinguish between 
identical and different structures. Applying the lattice 
vector optimization scheme on these relaxed structures 
was not able to further lower the fingerprint distances 
of identical structures. The coloring in Eig. indicates 
how the two structures belonging to a fingerprint distance 
were classified by MINT. Assuming that there are no dif¬ 
ferent structures with degenerate DET energies, one can 
conclude that MINT was not able to extract from the 
non-relaxed data set the information whether structures 
are identical or not and has erroneously assigned numer¬ 
ous identical structures as distinct, and vice versa to a 
lesser extent. 

Since both Oganov and BCM methods are global fin¬ 
gerprints that discard crucial information, they can fail to 
describe structural differences, a problem that becomes 
especially apparent when considering defect structures 
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in complex materials. As an example, a 2 x 2 x 2 super¬ 
cell was constructed of the cubic perovskite structure of 
LaAlOsP^. Half of the A1 atoms on the B-sites were re¬ 
placed by Mu. Then, single oxygen vacancies were intro¬ 
duced on symmetrically inequivalent X-sites. Obviously, 
the structural symmetry was reduced from the initial 
space group Pm3m of LaAlOs to the orthorhombic space 
group Amm2 of the supercell La(Al,Mn)03, and the oxy¬ 
gen vacancies resulted in structures with Cm and Pm 
symmetry. Both MINT and our fingerprint confirm that 
the structures are clearly different, whereas the Oganov 
and BCM fingerprint erronously classify both structures 
as identical. 


VI. CONCLUSIONS 


Atomic fingerprints that describe the scattering prop¬ 
erties as obtained from an overlap matrix are well suited 
to characterize atomic environments. An ensemble of 
atomic fingerprints forms a global fingerprint that allows 
to identify crystalline structures and to define configu¬ 
rational distances satisfying the properties of a metric. 
The widely used Oganov and BCM fingerprints do not 
have these properties and do also in practice not al¬ 
low a reliable way to distinguish identical from distinct 
structures. Symmetry function based fingerprints are of 
similar quality as our scattering fingerprints. However, 
they are much more costly to calculate. Both fingerprints 
have a cubic scaling with respect to the number of atoms 
within the cutoff range, but our prefactor of the matrix 
diagonalization is much smaller then the prefactor for the 
3-body terms required for the calculation of the symme¬ 
try functions. In contrast to ‘true’-Talse’ schemes such as 
employed in MINT which rely on a threshold and affirm 
that two structures are either identical or distinct, our 
fingerprint gives a distance between configurations. The 
appearance of a gap in the distance distribution indicates 
that a reliable assignment of identical and distinct struc¬ 
ture can be performed. In addition, strong reductions 
in the fingerprint distances upon lattice vector optimiza¬ 
tion can detect and eliminate thermal noise on the data 
set, rendering our fingerprint ideal to scan for duplicates 
in large structural databases. Our scheme can easily be 
extended to molecular crystals by introducing quantities 
that are analogous to molecular orbitals. Furthermore, 
the new fingerprint can be used to accurately explore 
local environments to create atomic and structural at¬ 
tributes for machine learning techniques. In summary, 
we have demonstrated that this approach allows to char¬ 
acterize crystalline structures by rather short fingerprint 
vectors and to decide more reliably whether structures 
are identical or not than previously proposed methods. 
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FIG. 1. Correlation between different fingerprints for all the 8 test sets obtained during structure prediction runs. SFP[x] and 
LFP[x] indicate short and long fingerprints with x orbitals, respectively, “opt high” and “opt low” indicate the quality of the 
geometry relaxation. 
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FIG. 2. Correlation between the energy difference and the fingerprint distance for all the 8 theoretical test setting of Fig^ 
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FIG. 3. Top panels: Correlation between the energy dif¬ 
ference and the molecular hngerprint distance (MFP) for 
formaldehyde (a) and paracetamol (b). Bottom panels: Cor¬ 
relation between molecular hngerprint distance and standard 
hngerprint distance (short contracted hngerprint with s or¬ 
bitals only, SFP[s]) for formaldehyde (c) and paracetamol (d). 
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(a) (b) 



(c) (d) 



FIG. 4. The nodal character of the first six principal vectors 
for the paracetamol molecule. The atoms are colored accord¬ 
ing to the sign of the elements of the hrst six principal vectors 
W^’^. A systematic colour pattern can be observed. The first 
principal eigenvector never changes sign and has therefore no 
nodes (a). Higher principal vectors exhibit more and more 
nodes (b-f). 
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FIG. 5. Panel (a) shows along the x-axis the ordinary fin¬ 
gerprint distances and along the y-axis the lattice optimized 
fingerprint distances for the Zr 02 structures retrieved from 
the OQMD. Distances between two structures that were iden¬ 
tified as identical by the structural comparison algorithm im¬ 
plemented in MINT are shown in red and structures that were 
identified as distinct are shown in blue. Panel (b) shows the 
correlation between the DFT energy differences among all re¬ 
laxed structures and the ordinary fingerprint distances. 






































